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Abstract: Interlocus gene conversion occurs such that a certain length of DNA fragment is 
non-reciprocally transferred (copied and pasted) between paralogous regions. To understand 
the rate and tract length of gene conversion, there are two major approaches. One is based 
on mutation- accumulation experiments, and the other uses natural DNA sequence variation. 
In this review, we overview the two major approaches and discuss their advantages and 
disadvantages. In addition, to demonstrate the importance of statistical analysis of empirical 
and evolutionary data for estimating tract length, we apply a maximum likelihood method to 
several data sets. 
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1. Introduction 

Gene conversion is a recombinational process initiated by a double strand break (DSB), through which 
a DNA fragment is non-reciprocally transferred (copied and pasted) generally between allelic regions 
{i.e., allelic gene conversion [1]). In addition, gene conversion occurs between paralogs when they have 
sufficient sequence homology (non-allelic or interlocus gene conversion) [2,3]. Although the mechanism 
is not fully understood yet (see Hastings [4] and Ling et al. [5] in this special issue), interlocus gene 
conversion is a major mutational process that occurs in both meiosis [2] and mitosis [6]. Basic questions 
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on gene conversion include (i) What is the rate of gene conversion under what condition? and (ii) What 
is the distribution of the tract length of gene conversion? Addressing these fundamental questions will 
provide great insights into how important role gene conversion plays as a mutational mechanism. 

There are two potential approaches to estimate the rate and tract length of gene conversion. A 
straightforward empirical approach involves mutation {i.e., gene conversion) accumulation studies, in 
which the rate of gene conversion can be directly estimated. The other is an evolutionary approach, 
which utilizes DNA sequence data from multiple individuals, where the footprints of a number of gene 
conversion events are accumulated in the evolutionary history. The empirical approach provides more 
accurate estimates than the evolutionary approach, but the amounts of data are still limited because 
well-established experimental systems are available only for several model species including yeast and 
mouse. In contrast, the evolutionary approach can be readily applied to any species when sequence data 
for paralogs are available. 

In this article, we first review researches that estimated the rate and tract length of interlocus gene 
conversion by the two approaches, and discuss their advantages and disadvantages. Furthermore, we 
reanalyze empirical data from yeast and rodent to estimate the mean tract length by using a simple 
maximum hkelihood (ML) method. We also apply the ML method to large-scale human genotyping 
data of diseases genes, in which gene conversion is known to cause diseases. Based on these results, we 
point out the importance of statistical analysis of empirical and evolutionary data. 

2. The Rate of Gene Conversion 

2.1. Empirical Approach 

Most transgenic systems for studying gene conversion use strains (or cell lines), in which a pair 
of genes are set up by transferring artificially edited DNA sequences (Figure 1), so that the nature 
of interlocus gene conversion can be investigated under an arbitrary condition [7]. The gene has a 
target marker site called "selected marker" (reversed triangles in Figure 1), and there is a trick that 
makes it possible to recognize if the selected marker is converted (e.g., in yeast, if a gene involved in 
nutrient requirement such as uracil or histidine is used, gene conversion induces prototroph formation). 
Therefore, it is easy to screen for strains that experienced gene conversions. Although Figure 1 illustrates 
a simple case, it is possible to set multiple selected markers. 

In the baker's yeast Saccharomyces cerevisiae, one of the model species of gene conversion 
studies, there are a number of researches that estimated the interlocus gene conversion rate in various 
conditions [8-15]. Estimates have a wide range from ~10^^° to ~10^'^ per cell division (we exclude 
classic works that reported estimates per culture). This great amount of variation between different 
researches largely depends on the fact that the gene conversion rate is affected by many factors. 

First of all, the rate in meiosis is different from that in mitosis. Jinks-Robertson and Petes [12] showed 
that the former is 300 times higher than the latter, and slightly lower estimates for the meiosis-mitosis 
ratio were reported later (~15 in [16] and ~100 in [17]). 

Second, genomic location matters. In general, the rate of intrachromosomal gene conversion may be 
higher than that of interchromosomal gene conversion. Lichten et al. [13,15] showed that the former is 
several times higher than the latter in both meiosis and mitosis (see also [14,18]). Lichten et al. [15] also 
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investigated the positional effect on the gene conversion rate when two paralogs are located with variable 
distances between them. It was found that the gene conversion rate could be negatively correlated with 
the distance (see also [11]). It seems that the pattern may not be very different between in meiosis and 
in mitosis [13,15]. 

Figure 1. Illustration of a typical experiment to screen for gene conversion. See text 
for details. 
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Third, the rate should be in a negative correlation with the sequence identity between paralogs. It is 
considered that gene conversion mainly occurs when the identity is more than 80% [19,20], but gene 
conversion can occur with identity <80% although the rate is low [21]. Recent works clearly identified 
a negative correlation between the sequence identity and gene conversion rate [22-25]. 

Fourth, it seems that the length of completely identical region has a significant effect on the rate of 
gene conversion. Ahn et al. [26] measured the rate with variable lengths of a recipient copy (from 
26 bp to 702 bp), and found that although the rate was extremely low, gene conversion occurred even 
with the shortest paralog {i.e., 26 bp), suggesting gene conversion requires identical regions as small as 
~10 bp (see also Mezard et al. [21] for a similar result). A later work by Jinks-Robertson and Petes [14] 
found that the rates for paralogs that have identical regions with >250 bp were much higher than the 
background rate. This work brought the concept of the minimal efficient processing segment (MEPS), 
which was first introduced for homologous recombination in bacteria [27]. Jinks-Robertson and Petes 
[14] suggested that MEPS for yeast would be around 200 bp, which still serves as a good standard [7,28], 
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but this does not necessarily rule out the possibility of gene conversion with paralogs that are shorter than 
MEPS [21,23,26,29,30]. 

The empirical approach has also been commonly applied to mouse {Mus musculus) and Chinese 
hamster {Cricetulus griseus). Extensive experiments exhibited quite similar results to those of yeast. It 
seems that the rate in meiosis may be 100~1000 times higher than in mitosis [31]. A negative correlation 
between the gene conversion rate and paralogous distance was observed [32,33]. It was found that the 
rate is dramatically reduced for paralogs with identity ~80% in comparison with the rate for 100% 
identical pairs [34-36]. MEPS was estimated to be roughly 200 bp [35], which was repeatedly confirmed 
by follow-up experiments [37,38]. 

Thus, we overviewed experimental studies of yeast and rodent which explored the factors that affect 
the rate of gene conversion. These findings should be robust because each of them was demonstrated 
under a certain experimental condition. However, we found that it is difficult to compare the absolute 
gene conversion rates between different experiments, and this is why we avoided arguments with absolute 
values. For example, some researches take advantage of specific sites where DSBs can be induced, (e.g., 
the HO site in the MAT locus [39] and target site of 1-Sce I endonuclease [40]) and some did not. A 
technical problem is that, in a simple experimental design with a single selected marker, an estimated rate 
may include both gene conversion and unequal crossing-over because they have an identical outcome. 
Such a rate cannot be fairly compared with an estimate from a more sophisticated experiment, in which 
the two mechanisms can be distinguished, e.g., by using secondary markers. Furthermore, in vivo, 
the rate should largely depend on the genomic background, for example, the number of paralogs in 
the genome [11], special motives associated with recombination [41,42], activity of numerous enzymes 
involved in DNA repair and recombination (reviewed in [43]). Therefore, it is important to notice that 
the available estimates may not be representative values of the gene conversion rate. 

2.2. Evolutionary Approach 

The pattern of polymorphism e.g., SNP in duplicated regions is very informative because it should 
have many footprints of gene conversion. This approach heavily depends on theoretical understanding 
of population genetics on what kind of polymorphism pattern is likely expected under what rate of gene 
conversion. When the process of gene conversion is modeled in the framework of population genetics, 
it is usually assumed that a gene conversion event can be initiated at a random position at a certain 
rate, g, and the elongation of the gene conversion tract occurs either in the 5' or 3' direction. It is 
commonly assumed that the elongation can be terminated at any position with a constant probability, 
say g, such that the tract length follows a geometric distribution with parameter g, or an exponential 
distribution with continuous approximation [44]. As the average tract length is T = 1/g, the per-site 
rate of gene conversion is defined as c = Tg, which is the probability that a particular site is involved 
in a gene conversion event per generation. According to population genetic theory [45-47], c can be 
well estimated from SNP data. From genome-wide SNP data in yeast [48], estimates of the relative rate 
of gene conversion c to the point mutation rate typically ranges from 10 to 100, which is in agreement 
with estimates for several duplicates in Drosophila melaogaster [49,50]. There are many locus-specific 
estimates of c for a wide range of species including malaria parasite [51], plant [52], avian [53] and 
human [54-56]. 
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A disadvantage of this approach is that estimates are model-dependent; therefore, they could be 
biased if the assumptions of the model do not hold. For example, as Innan's theory [45,46] assumes 
a constant-size population, an estimate may not be reliable if the population is growing or subdivided. 
This is a common problem shared by all estimates by population genetics-based methods, such as those 
of mutation rate and recombination rate. It is important to understand how population genetics-based 
estimates are quantitatively affected by violation of the assumptions. Coalescent simulations [46] would 
be one of the best ways to do this. 

3. The Tract Length of Gene Conversion 

3.1. Empirical Approach 

It is relatively straightforward to develop an empirical system for estimating gene conversion tract 
length by modifying the transgenic system for estimating the gene conversion rate. Figure 1 illustrates 
such a strategy, which has a selected marker at position 432 bp. In addition to the selected marker, 
there are several silent markers inserted in the donor gene. Because silent markers could be coconverted 
with the selected marker, it is possible to identify the converted tract in the marker space. Figure 1 also 
illustrates an example of a gene conversion event, which includes four markers from positions 409 to 
565. It can be inferred that the 5' break point should be between positions 304 and 409 and the 3' break 
point should locate between positions 565 and 667. Therefore, the maximum and minimum lengths of 
this conversion tract are 362 bp and 157 bp, respectively. Thus, the data of a number of detected gene 
conversion can be described by a simple matrix with the presence(+)/absence(— ) of the markers in the 
recipient copy (Figure 1). Note that all data should have a positive sign (+) at the selected marker with 
the blue triangle. 

Motivated by classic works [57,58], Ahn et al. [59] extensively investigated the tract length of gene 
conversion in yeast, and found the average of the minimum lengths is about 500 bp. There are a number 
of follow-up studies, most of which reported an estimate of the average of 200-300 bp [60-64] (Note 
that most of them reported the mean of the maximum and minimum lengths). It seems that there is no 
big difference in the tract length of gene conversion between mitosis and meiosis [25]. Similar estimates 
(287—296 bp) have also been obtained for rodents (Chinese hamster) [63,64]. 

Thus, this experimental system has contributed to our understanding of the typical length of gene 
conversion. However, a problem is that the initiation and termination positions of the tract are usually 
identified between markers, and it is difficult to know the actual length. It is more serious when a tract 
covers all markers so that the tract is potentially very long. Therefore, to understand the distribution 
of actual tract length, it is needed to analyze the data statistically [65-69]. Here, we use a maximum 
likelihood (ML) method to estimate the mean tract length by assuming that the elongation of a gene 
conversion tract is terminated at a constant rate. This assumption should be reasonable because we 
found that the data of Taghian et al. [63] and others [59,61] exhibit a good fit to a geometric distribution, 
or an exponential distribution with continuous approximation (Figure 2). 

Table 1 summarizes data which are suitable to infer the mean tract length of interlocus gene conversion 
with our ML method. We collected data sets with the number of markers M > 4 and the sample size 
n > 10. All data in Table 1 are for mitosis. For each detected gene conversion, the maximum and 
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minimum lengths can be parsimoniously determined and summarized in Figure 3. We excluded data that 
need double conversion events to explain parsimoniously. Such data (called discontinuous tracts) arise at 
a low frequency [19,59-63]. Figure 3 shows that many tracts involve only the selected marker, indicting 
there are a large number of short gene conversion. On one hand, there are many tracts that could be 
potentially very long because they have positive signs at the left- and/or right-end markers. 



Figure 2. The proportion of coconverted silent markers as a function of distance from the 
selected marker. Data from Taghian et al. [63]. 




100 200 300 400 
Distance from selected marker (bp) 



Table 1. Summary of the data used for the ML analysis. 



Data Set Donor / Recipient Genes " 


Length of 


Sequence 


M (# of Markers) 


n (Sample Size) 


Gene Conversion Rate* 




Analyzed Region 


Identity 








Saccharomyces cerevisiae (yeast) 












Bailis et al. [19] SAM2 {IV) 1 SAMl {XII) 


869 bp 


83% 


4 


37 


8.4 X 10-9 


Harris et al. [20] PMAl {VII) 1 PMA2 {XVI) 


1830 bp 


85% 


24 


13 


5.0 X 10-9 


Cho e« a/. [6 1 ] ura3 / ura3 {4.9 kb) 


1130 bp 


99% 


11 


86 


1.6 X lO--""' (3.8 X 10-^) 


Palmer et al. [70] ura3 1 ura3 (4.9 kb) 


54 bp 


99% 


5 


49 


3.3 X lO-** (1.1 X 10-=*) = 


Mus musculus (mouse) 












Yang et al. [71] HSV-1 tk 1 -2 tk «6.5 kb) 


2500 bp 


81% 


14, 8^* 


19 


1.3 X 10-* (2.3 X lO-'') 


Ruksc et al [72] C^tS' / C/i3' «8 kb) 


1876 bp 


99% 


6 


24 


n.d. 


Cricetulus griseus (Chinese hamster) 












Taghian et al. [63] MMTV«eo / neol2 (3.8 kb) 


989 bp 


99% 


13 


32 


< 4.8 X 10-1° (2.7 X 10-^5) = 


Kim et al. [64] MMTYneo 1 neoll (3.8 kb) 


989 bp 


99% 


10 


11 


< 5.6 X 10-10 (2.3 X 10-5) f 



" The chi'omosome numbers of the donor and recipient genes are shown in the parentheses if they are located on different chromosomes. 

If they are on the same chromosome, the distance between them is shown. 

Estimated rate from a strain with accelerated DSB is shown in parentheses. 

Computed by assuming the doubling time is 3 h and the culturing time is 3 days. 

We pooled two data sets with different numbers of markers, 14 and 8. 

Computed by assuming the doubling time is 15 h and the culturing time isl4 days. 
^ Computed by assuming the doubling time is 15 h and the culturing time is 12 days. 

In order to estimate the average length of gene conversion, T, from these data, we use a simple ML 
method. Our method is a slightly modified version of Gloor et al. [65] (see also [66]) so that it is 
possible to apply to data with multiple selected markers. Let D be the data set obtained from a single 
pair of genes, which consists of n identified tracts. Each tract is characterized by (i, j) when the tract 
includes a region from the ith to jth markers {i < j) but not z — 1 or j ' + 1th markers. Note that i < k < j 
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when the fcth marker is the selected marker that is used for detecting gene conversion. Then, D can be 
simply described as 

D = {(h, jl), («2, J2), ■ ■ (^n, jn)} 

where {iuji) is for the l\h tract (Z = 1, 2, 3, n). M represents the total number of markers, and the 
position of the zth marker is denoted by {i = 1, 2, 3, M). In addition, we define uiq = L_ and 
niM+i = L+, where L_ and L+ are the left and right ends of the duplicated region. We here assume 
L_ <^ nil and ^ mjvf {i.e., L_ = —00 and = 00 for mathematical convenience), which makes 
the following computation much easier with a negligible effect on the ML estimate. 

To obtain the likelihood function of data D conditional on T, we consider three different models 
depending on the experimental design and the distribution of gene conversion tract. The first model 
(Model 1) can be applied to many experimental designs, in which DSBs are induced at a particular 
position (denoted by Xk) and the selected marker is essentially identical to the induced position. In 
such a case, it is straightforward to assume that the elongation of converted tracts in the two directions 
independently follow an exponential distribution (see above and Figure 2). Let xi and X2 be the 5' and 
3' breakpoints of the tract. Then, the probability that a tract that includes the selected marker is from 
position a; 1 to X 2 is given by a function of T: 



Pl{Xi,X2\T) 



r, 2(a;j.-a:i) 




T dxi X 


fj,, o 2(a;2-a;j.) 






p T 




J^2^ 





(1) 



Then, because the focal tract has to satisfy the two conditions, mj_i < xi < rrii and rrij < X2 < rrij^i, 
we have the probability that a conversion tract is given by 

Probi{i,j\T) = / pi{xi,X2\T)dx2dxi (2) 

Then, the likelihood of the data D is given by 

n 

Li{T\B) = l[Proh{ti,ji\T). (3) 

1=1 

In other cases, the location of DSBs should be treated to be unknown. If we assume the exponential 
elongation of a tract from a DSB in the two directions independently as defined in (1), then the total 
length follows a gamma distribution with shape parameter two. In this Model 2, we assume this gamma 
distribution for the length of an individual tract (t): 

At 2t 

P2{t) = 7^e T (4) 

and Prob2{i, j\T), the probability that a conversion tract is given by conditional on a conversion 
event that involves at least one markers is given by 

XT' i XT'"^' P^(^^ ~ Xi)dx2dxi 
Prob2 {i,j\T) = — / — (5) 

E(ij)cA Jm ^2(3^2 - Xi)dX2dXi 
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Figure 3. Application of the ML method to the eight data sets in Table 1 . For each data set, 
the minimum and maximum lengths of identified fragments are shown in the left panel. The 
minimum length is given by a filled box with the maximum length by a bar. Filled triangles 
indicate the positions of the selected markers. The right panel shows the log-likelihood curve 
and the ML estimate T with the 95% confidence interval. 
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where A consists of all possible pairs of that satisfy the condition, i < j. 

From (5), we can obtain the probability that a gene conversion event is captured in the experimental 
system. In other words, we are interested in the probability that the tract includes the selected marker (/. e. , 
A;th marker). Assume that B consists of all possible pairs of (i, j) that satisfy the condition, i < k < j. 
Then, this probability is given by 



Prob2{B\T) = Prob2ii,j\T) 

(ij)CB 



(6) 
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From (5) and (6), we have the probability that the detected tract is {i, j] 



p /, r -i-R T-A Prob2{i,j\T) 

^^^^^(^'^"^'^)= iV^^B^ (7) 



and the likelihood of the entire data D is given by 

' Prob2{B\T) 



1=1 

In addition, we consider Model 3, in which the entire length of a tract follows an exponential 
distribution: 

P3{t) = ^e-^ (9) 

This is a frequently used assumption in evolutionary models as introduced in the "Evolutionary 
approach" section. For this Model 3, the likelihood function of D is simply given by (5) by replacing 
P2{t) withp3(t). 

Using these hkelihood functions, we estimated T for each data set in Table 1 and the results are shown 
in Figure 3. If the data set has a selected marker at the position of induced DSB, Model 1 is applied, 
otherwise both Models 2 and 3 are used. ML estimates based on these three models are denoted by Ti, 
T2 and T3. It was found that our method provides ML estimates of the mean tract length with relatively 
narrow confidence interval. The results of Models 2 and 3 are not very different to each other. For the 
four data sets of yeast, the estimates are within a quite small range from ~50 to several hundred bp, 
which seems to be much smaller than allelic gene conversion [73]. The results for rodents are similar; 
our estimates are around 100 bp except for the data of Ruksc et al. [72], which provide an estimate 
of >1 kb with a much wider confidence interval (roughly 1 kp) than the others. This may be partly 
because the marker density is very low (the average interval is 375 bp for this data set, while the average 
of the others is ~ 160 bp). 

Thus, the ML method makes it possible to estimate the mean tract length with relatively small amount 
of data. It is demonstrated that the empirical approach with statistical analyses is a powerful means to 
understand the tract length of gene conversion. 

3.2. Evolutionary Approach 

DNA sequence data potentially include information on the tract lengths of gene conversion events 
that occurred in their ancestral lineages. GENECONV is a software developed by Sawyer [74] to detect 
converted regions in aligned DNA sequences. GENECONV analyzes an alignment of multiple sequences 
in a pairwise manner, and identifies unusually long regions of high identity between the focal pair, which 
are candidates of gene conversion. The algorithm involves statistical treatment, which conditions on the 
pattern of variable sites in the other sequences in the alignment. The statistical significance is determined 
by random-shuffling of variable sites in the alignment. 

As was demonstrated by our recent simulation work [75], it is not appropriate to use GENECONV 
to infer the actual tract length. There is no doubt that the regions identified by GENECONV are strong 
candidate regions that have undergone recent gene conversion. However, the identified region is not 
necessarily to correspond to the region that was really transferred by a single gene conversion event. 
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Accordingly, the result of GENECONV is sometimes misinterpreted as if the output (a list of candidate 
converted tracts) reflects the distribution of the tract length of gene conversion {i.e., Ref. [76]). This 
effect is easily demonstrated by simple illustrations in Figure 4. In the left panel, two conversions in the 
opposite directions share a part of the tract. If GENECONV is applied to the sequence data in the box, 
it will likely identify two regions (with red lines in Figure 4) with lengths much shorter than the real 
converted tract lengths. In contrast, the two overlapping conversion events in the same direction result 
in a large region identified by GENECONV. The real situation should be much more complicated with 
a number of overlapping recurrent gene conversion events, indicating that the empirical approach would 
be the only reliable means to investigate the tract length of gene conversion. It should be noted that there 
are several algorithms for detecting gene conversion tracts [77-81], and we here treated GENECONV 
as a representative of them because they share the basic idea. Thus, using natural variation in DNA 
sequences is not very suitable to investigate the tract length of gene conversion. 

Figure 4. Illustration of the effect of multiple gene conversions on the performance of 
GENECONV. See text for details. 
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However, there are special cases where we can obtain high quality of data. One example is disease 
genes in humans. There are a number of human diseases caused by gene conversion, as reviewed by 
Chen et al. [41] in this issue (see also [3]). In many cases, diseases are caused by gene conversion 
that transfer a mutation from a pseudogenized duplicate to the functional copy, resulting in missense 
mutation [82-84], nonsense mutation [85], frameshift [86,87], change of splicing site [88]. These 
conversions occur between highly similar sequences (more than 90%, see [3]), which seems to be slightly 
higher than that for yeast and rodent. When the sequences of the functional and pseudogenized copies 
are known, the positions of causal mutations can be identified. In such a case, there are some interesting 
follow-up studies that a region encompassing the causal mutations is resequenced or genotyped for a 
number of patients. The ML method can be directly applied to such a data set to estimate the average 
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tract length with one condition, that is, each of the detected gene conversions was created by a single 
gene conversion event. This assumption may be reasonable for serious diseases, for which the causal 
mutation can not increase in frequency in the human population. If so, almost all mutations should 
be eliminated from the population in a very short time, during which it is quite unlikely that another 
gene conversion occurs in the focal region. As a consequence, the identified gene conversion tracts in 
sequence data are most likely created by a single independent gene conversion. It should be noted that, 
gene conversions under our analyses here are those occurred in meiosis, while the data in Table 1 are 
for mitosis. 

In Table 2, we summarize four genotyping data sets from human disease genes, for which our 
assumption of independent gene conversion should likely hold because the frequencies of patients are 
extremely low. The ML function (8) was applied to these four data sets and the results are summarized in 
Figure 5. The ML estimates of T are quite similar to those of the empirical approach for yeast and rodent. 



Table 2. Summary of the data of human disease genes used for the ML analysis. 



Data Set 


Disease 


Recipient Gene" 


Analyzed Region 


Sequence 


n 


Frequency of Patients'" 










Identity 


(Sample Size) 




Gupta et al. [89,90] 


von Willebrand 


vWF 


intron 27 and 


97% 


13 


< 1/500 (type 2M) 




disease types 2M & 3 


(12pl3.3/22qll.22-qll.23) 


exon 28 






1/500,000 (type 3) 


Friaes et al. [91] 


congenital adrenal 


CYP21A2 


exons 1 — 10 and 


96-98% 


92 


1/15,500-1/280 




liyperplasia 


(6p21.3, 30 kb) 


flanking regions 








Tayebi et al. [92] 


Gauchar disease 


GBA (lq21, 16 kb) 


exons 3—11 


96% 


34 


1/1,000,000-1/850 


Nicolis et. at. [93] 


Shwachman- 


SBZ)5(7qll,305 kb) 


exons 1—5 and 


97% 


25 


1/100,000 


Boocock et. al. [88] 


Diamond syndrome 




junction of exon/intron 




235 





" The chromosomal positions of the donor and recipient genes are shown in the parentheses if they are located on different chromosomes. 

If they are on the same chromosome, the position of the recipient gene and the distance between the two genes are shown. Note that all 

donor genes are pseudogenes of each functional gene. 
^ This frequency includes patients that are not caused by gene conversion. 



Figure 5. Application of the ML method to the four data sets in Table 2. See the legend of 
Figure 3 for details. 
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4. Conclusions and Perspectives 

As well as other mutational mechanisms, interlocus gene conversion could create changes in DNA 
sequences, which will potentially cause increase or decrease of the fitness of the host individual. When 
the fitness impact is strong, the conversion should be immediately selected for or against. To understand 
the role gene conversion plays in organism's life and evolution, it is crucial to know the rate and tract 
length of gene conversion. 

In this review, we first overviewed the empirical and evolutionary approaches to estimate the rate 
of gene conversion between duplicated regions. The empirical approach generally involves mutation 
accumulation studies. Most studies use strains or cell Unes, to which donor and/or recipient genes are 
transferred. This transgenic system, which is well established in yeast and rodent, is flexible enough 
to explore the rate under variable conditions. It has been demonstrated that the rate is determined by 
many factors including the genomic location and nucleotide identity between duplicates. Evolutionary 
approaches may be suitable to estimate the rate for non-model species, because they can be applied 
when DNA sequence data are available. Estimates based on polymorphism (SNP) data are roughly in 
agreement with those of the empirical approach. 

Inferring tract lengths is more complicated in both of the two approaches. The empirical approach 
uses the transgenic system for estimating the rate with a modification; A number of markers are 
distributed in the donor (or recipient) sequences which make it possible to trap conversion tracts. As this 
system just allows one to identify the initiation and termination positions of the tract between markers, it 
is difficult to know the actual length. Therefore, we here used a simple ML method to estimate the mean 
tract length. The method well estimated the mean lengths with relatively narrow confidence intervals for 
many data sets, indicating the importance of statistical analysis of empirical data. Most estimates range 
from 50 to several hundred bp. 

In contrast, evolutionary data are not very informative for the tract length mostly because evolutionary 
data accumulate a number of footprints of historical gene conversions that potentially overlap with one 
another. Exceptions include genotyping data of human disease genes, at which gene conversion causes 
serious disease when it transfers a deleterious mutation from a pseudogenized duplicate. With this 
condition, if we have a sample of patients that are not genetically related, most of the detected gene 
conversions in the gene should be very young and independent. Our ML method was successfully 
applied to such data. 

The rate of gene conversion we considered in this review is a per-site rate, that is, the rate at which 
a particular site is involved in a gene conversion event. This rate is different from the rate that a gene 
conversion event occurs. Under the model we introduced above, the former is denoted by c and the 
latter is g. g can be estimated if we know the average tract length, T, because g is simply given by c/T. 
According our ML analysis, T would be on the order of 10^~^ bp, so that the initiation rate of gene 
conversion per site would be one or two orders of magnitude lower than c. 

Thus, our knowledge on the rate and tract length of interlocus gene conversion is growing. However, 
data are still limited to several model systems. Because the rate should be highly variable across the 
genome, to understand the genomic landscape of gene conversion should be one of the future directions. 
It has been thought that many biological features are shared by gene conversion and recombination. 
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and the recombination rate is also highly variable across the genome. There are a number of hotspots 
of recombination in genomes [73,94-97], and some of them would also be associated with interlocus 
gene conversion [98-100] although the amount of information is still limited. To fully understand 
the mechanisms behind the great variability of the gene conversion and recombination rates, we need 
much more data than currently available, part of which may be obtained by taking advantage of 
next-geneartion sequencing. 
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